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1 Introduction 

This report describes experiences with implementing the NAS Computational Fluid Dynamics 
benchmarks using a parallel object-oriented language, Charm+ + . Our main objective in imple- 
menting the NAS CFD kernel benchmarks was to develop a code that could be used to easily 
experiment with different domain decomposition strategies and dynamic load balancing. We also 
wished to leverage the object-orientation provided by the Charm+-j- parallel object-oriented lan- 
guage [7, 8], to develop reusable abstractions that would simplify the process of developing parallel 
applications. 

We first describe the Charm++ parallel programming model and the parallel object array 
abstraction, then go into detail about each of the Scalar Pentadiagonal (SP) and Lower/Upper Tri- 
angular (LIT) benchmarks, along with performance results. Finally we conclude with an evaluation 
of the methodology used. 

2 The Charm+-|- parallel object-oriented programming model 

This work is based on the parallel object-oriented language Charm++[ 7, 8], an extension of C++. 
Charm++ is an explicitly parallel language, whose parallel constructs are modeled after the Charm 
parallel programming system [4]. Its innovative features include message driven execution for la- 
tency tolerance and modularity, dynamic creation and load balancing oi concurrent objects, branched 
objects which have a representative on every processor, and multiple specific information sharing ab- 
stractions. This section describes the essential features, syntax, and implementation of Charm++. 

Charm++ was designed to address the issues of portability, need to deal with communication 
latencies, support for irregular and dynamic computation structures, and reuse of parallel software 
modules. 

2.1 Message Driven Execution 

Charm++ uses message driven execution to overcome the problem of communication latency. In 
message driven execution, computation is initiated in response to the availability of a message. 
In Charm++, messages are directed to a method inside an object. Messages received from the 
network are kept in a queue, from which the system scheduler picks a message, and invokes the 

specified method within the object at which the incoming message is directed. 
Pltifh. 
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Message-driven execution, combined with an asynchronous (non-blocking) model of communi- 
cation, exhibits latency tolerance by overlapping computation and communication adaptively and 
automatically. Each processor typically has multiple objects waiting to be scheduled based on 
availability of messages directed at them. A remote operation (such as fetching remote data), is 
initiated by an object by sending a message asynchronously to an object on the remote processor 
and returning control to the runtime system. The runtime system schedules pending computations 
in any other objects on the processor. When the remote data finally arrives in the form of a mes- 
sage, the runtime system can schedule the requesting object again. Multiple remote operations 
could be initiated by a single object and could be processed in the order these operations fin- 
ish. Thus message-driven execution has several advantages over the traditional “blocking-receive” 
based communication, offering better performance through adaptive scheduling of computations. 
Message-driven execution also helps to promote modularity and reuse in parallel programs without 
losing efficiency, by allowing the overlap of computations across modules. 

2.2 Dynamic Object Creation: chares and messages 

In order to support irregular computations in which the amount of work on a processor changes 
dynamically and unpredictably, Charm++ allows dynamic creation of parallel objects (chares), 
which can then be mapped to different processors to balance loads. A chare is identified by a 
handle , which is a global pointer. 

Chares communicate using messages. Sending a message to an object corresponds to an asyn- 
chronous method invocation. Message definitions have the form : 

message class MessageType { 

// List of data and function members as in C++ 

}; 

Chare definitions have the form 

chare class ChareType { 

// Data and member functions as in C++. 

// One or more entry functions of the form : 

entry: 

void FunctionName(MessageType *MsgPointer) 

{ C++ code block } 

} ; 

The entry function definition specifies code that is executed atomically when a message is 
received and scheduled for processing. Only one message per chare is executed at a time. Thus 
a chare object defines a boundary between sequential and parallel execution : actions within a 
chare are sequential, while those across chares may happen in parallel. Entry functions are public 
object methods with message as a parameter and no return value. The handle of a chare is of 
type “ChareType handle”, and is unique across all processors. While multiple inheritance, dynamic 
binding, and overloading are supported for sequential objects by C + + , Charm++ extends these 
concepts for chares (concurrent objects), thus permitting inheritance hierarchies of chare classes. 

Every Charm++ program must have a chare type named main, which must have the function 
main. There can be only one instance of the main chare type, which usually executes on processor 
0. Execution of a Charm++ program begins with the system creating an instance of the main 
chare and invoking its main function. Typically, this function is used by the programmer to create 
chares and branched chares and initialize shared objects. 
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Chares are created using the operator newchare, similar to the new in C++ : 
newchare ChareType(MsgPointer) , where ChareType is the name of a chare class. This operator 
deposits the seed for a new chare in a pool of seeds and returns immediately. Later, the runtime 
system will actually create the chare on some processor, as determined by the dynamic load balanc- 
ing strategy. When the chare is created, it is initialized by executing its constructor entry function 
with the message contained in MsgPointer as parameter. The user can also specify a processor 
number as an optional argument to create the chare on specific processor, thereby overriding the 
dynamic load balancing strategy. A chare can obtain its own handle once it has been created and 
pass it to other objects in messages. 

Messages are allocated using the C++ new operator. Messages can be sent to chares using the 
notation ChareHandle=>EF(MsgPointer) 1 This sends the message pointed to by MsgPointer to the 
chare having handle ChareHandle at the entry function EF, which must be a valid entry function 
of that chare type. 

2.3 Dynamic load balancing 

Charm++ supports dynamic object creation with dynamic load balancing libraries which help to 
map newly created chares to processors so that the work is balanced. Since the patterns of object 
creation vary widely among applications, and the characteristics of the underlying parallel machine 
also vary, different dynamic load balancing strategies become suitable in different circumstances. 
Charm++ provides many generic libraries for dynamic load balancing, which can be selected by the 
user at link time, depending on the requirements of the application. These libraries are implemented 
as modules on top of the basic runtime system. 

2.4 Branched chares 

A branched chare is a group of chares with a single name. A branched chare has one representative 
(branch) chare on each processor, and has a single global handle. One can asynchronously invoke 
a method on (i.e. send a message to) a representative of a branched chare by specifying its handle 
as well as the processor, in addition, one can synchronously (i.e. just like a sequential function 
call) invoke a method within the local representative of a branched chare. 

Branched chares can be used to implement distributed services such as distributed data struc- 
tures, global operations and high level information sharing abstractions, thereby encapsulating 
concurrency. They can be used for static load-balancing in object-parallel computations (each 
representative performs the same computation on the data owned by it). They also provide a 
convenient mechanism for distributed data exchange between modules : the representatives of a 
branched chare in one module hand over the data to the representatives in the other module on 
their own processors, without the need for centralized transfer. Finally, branched chares can also be 
used to encapsulate processor-specific information. (Indeed branched chares are used to implement 
many dynamic load balancing strategies.) 

It is important to underscore that branched chares are objects. In particular, there may be 
multiple instances of the same branched chare class. So, simple local function calls do not provide 
the same service as the invocation of a method in a local representative branch. 

Branched chares are also created with the newchare operator : newchare ChareType (MsgPointer) . 
This causes the runtime system to create a branch on every processor and initialize it by executing 

1 (Nol.e the syntactic difference between asynchronous message sending and sequential method invocation as in 
CH — K) 
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its constructor entry function. Branched chares are usually created in the main function of the 
main chare, in which case this operator returns the handle of the newly created branched chare. 
ChareHandle[LOCAL]->DataMember and ChareHandle[LOCAL]->FunctionMember() are used to access 
public members of the local branch of a branched chare. 

ChareHandle [P] =>EF(MsgPointer ) sends a message to the function EF in the branch of a branched 
chare on the processor P. 

ChareHandle [ALL] =>EF(MsgPointer) results in a message being broadcast to all branches of a branched 
chare (i.e. to all processors). 

2.5 Specific information sharing abstractions 

Charm++ provides specific abstract object types for sharing information. Each abstraction for 
information sharing may be thought of as a template of an abstract object, with methods whose code 
is to be provided by the user. These shared objects have a global handle (name), and can be accessed 
on all processors, but only through their specific methods. These abstractions may be implemented 
differently on different architectures by the Charm++ runtime system, for efficiency. Some of the 
abstractions provided by Charm++ are: read-only variables, distributed tables, accumulators, and 
monotonic variables. Additional abstractions may be added as libraries, as the need for them arises. 

2.6 Other CharmH — b features 

Prioritized Execution : Charm++ provides many strategies that the user can select for manag- 
ing queues of messages waiting to be processed. Some of them (FIFO, LIFO, etc) are based solely 
on the temporal order of arrival of messages. However, in many applications (such a s algorithms 
with a critical path, search-based algorithms, and discrete event simulations), it is necessary to al- 
low the application to influence the order of processing of messages by assigning message priorities. 
Charm++ supports integer priorities as well as bit-vector priorities (with lexicographical com- 
parison of bit-vectors determining order of processing), which are especially useful for prioritizing 
combinatorial search algorithms. 

Conditional Message Packing : Charm++ allows arbitrarily complex data structures in mes- 
sages. On private memory systems, pointers are not valid across processors, hence it is necessary 
to copy (pack) the pointer-linked structure into a contiguous block of memory before sending the 
message. However, packing is wasteful if the message is sent to an object on the same processor, or 
on shared memory systems. To allow optimal performance in this context, for messages involving 
pointers, the user is required to specify the methods pack and unpack in the message class for pack- 
ing and unpacking messages that are called by the system just before sending and after receiving a 
message, respectively. Thus only messages that are actually sent to other processors are packed. 
Quiescence Detection : Since the Charm++ model provides independently executing parallel 
objects, there is no single global thread of control, hence detecting quiescence (termination) of 
a program is difficult. Charm++ provides a quiescence detection library for this purpose, which 
detects quiescence (when no object is executing any computation and all messages sent have been 
processed). The programmer may then choose to simply exit, or start the next phase of the parallel 
program. 

2 . 7 Implementation 

Charm+ + has been implemented as a translator and a runtime system. r lhe translator converts 
Charm + + constructs into C++ constructs and calls to the runtime system. The runtime system is 
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layered into a language independent portable layer Converse, on top of which is the Chare. Kernel 
layer. 

2.7.1 Converse : Portability and interoperability 

The Converse layer provides a portable machine interface which supports the essential parallel op- 
erations on MIMD machines. These includes synchronous and asynchronous sends and receives, 
global operations such as broadcast, atomic terminal I/O, and other advanced features. Some im- 
portant principles that guided the development of Converse include need-based cost (e.g. Charm++ 
should not need to pay the overhead of a tag-based receive mechanism provided by an underly- 
ing layer, since Charm++ uses message- driven execution; also, a system such as PVM should not 
have to pay the cost of prioritized scheduling that Charm++ needs), efficiency (the performance 
of programs developed on top of Converse should be comparable to native implementations) and 
component based design (the Converse layer is divided into components with well-defined interfaces 
and possibly multiple implementations which can be plugged in as required by higher layers). 

Converse is designed to help modules from different parallel programming paradigms to inter- 
operate in a single application. In addition to common components such as the portable machine- 
interface, it provides paradigm-specific components such as message managers and thread objects, 
that can be customized and used to implement individual language runtime layers. Converse sup- 
ports both SPMD style programs (which have no concurrency within a processor and explicit, static 
flow of control) as well as message-driven objects and threads (which have concurrency within a 
processor and implicit, adaptive scheduling). 

The Converse machine interface has been ported to most parallel machines. Languages imple- 
mented on the Converse framework include Charm, Charm-|--|-, PVM (messaging), threaded PVM, 
SM (a simple messaging layer), and DP (a data parallel language). 

2.7.2 Chare Kernel 

The Chare Kernel layer was developed originally to support Charm, but was modified to support 
CTT interfaces required for Charm-(-+ too. It implements various functions such as system initial- 
ization, chare creation, message processing (to identify the target object and deliver the message, 
to it), performance measurements, quiescence detection, etc. 

One important function of the Chare Kernel is to map parallel class and function names into 
consistent integer ids which can be passed to other processors. This is required because function 
and method pointers may not be identical across processors, especially in a heterogenous execution 
environment. The Charm++ translator cannot assign unique ids to classes and methods at compile 
time, because Charm++ supports separate compilation, and the translator does not know about 
the existence of other modules. Also, while passing ids for methods across processors, this mapping 
must be implemented so as to support inheritance and dynamic binding : when a sender sends a 
message to a chare C at an entry function E defined in C’s base class, C must call its own definition 
of E if it has been redefined, otherwise it must call its base class’ definition of E. 

To meet these requirements the Chare Kernel provides a function registration facility, which 
maintains the mapping from ids to pointers. The translator-generated code uses this registration 
facility during initialization at run-time to assign globally unique indices to chare and entry function 
names. This unique id can be passed in messages across modules. The translator also generates stub 
functions for every entry function in every chare class. When a message is received and scheduled 
for processing, the Chare Kernel uses this stub function to invoke the correct method in the correct 
chare object. For dynamic binding to work, the stub function invoked is the one corresponding 
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to the static type of the chare handle at the call site; the C++ virtual function mechanism then 
invokes the correct method depending on the actual type of the chare object. 

The Chare Kernel uses a scheduler (defined as a component of Converse) which is essentially a 
“pick and process” loop. It picks up incoming messages from the Converse message buffer, enqueues 
them by priority according to a user-selected queueing strategy, and then picks the highest priority 
message from the queue for processing. 

Finally, the Chare Kernel also manages chare handles (which are essentially global pointers), 
and does the mapping from local object pointers to chare handles and vice versa. Branched chare 
handles need to be managed slightly differently, since they have a single global handle for a group 
of chares : the Chare Kernel needs to ensure that a consistent handle is used on all processors. 


3 Parallel Array Abstraction 

Since the NAS parallel benchmarks involved computations on a three dimensional data space, the 
natural parallelization scheme was to divide these arrays in many smaller cubes and perform com- 
putations on these cubes in parallel while preserving data dependencies. In order to represent a 
multi-dimensional array in parallel, we developed a parallel object array abstraction for Charm++ 
[9]. This abstraction allows the programmer to create an array of parallel objects, map it to proces- 
sors according to the parallel algorithm requirements, send messages to selected elements, perform 
global operations such as multicasts, and specify new mappings for dynamic decompositions. 

A parallel array is a group of objects (the array elements) with a common global name (id), 
which are organized in a multidimensional, distributed array, with each array element identified 
by its coordinates. The mapping of array elements to processors is specified by a user-provided 
mapping function. A default mapping is also provided for cases when the mapping is not significant. 
The data space of the problem could be partitioned into contiguous blocks and could be assigned 
to parallel objects that are elements of the parallel array. 

3.1 Parallel Array Definition 

A parallel array is defined as a normal parallel object (chare) class in Charm++, except that it 
must inherit from the system-defined base class array. This base class provides the following data 
fields : 

• thishandle : this gives the unique handle (global pointer) of the array element. 

• thisgroup : this gives the global id by which the whole array is known. 

• thisi, thisj, thisk : these give the coordinates of the array element 2 . 

Messages that are sent between array elements must inherit from the system-defined message 
class arraymsg. The following code gives an example of an array definition. 


message class MessageType : public arraymsg { 
// list oi data fields to be sent 

} ; 


^Currently, only I, 2, or 3- dimensional arrays are supported, although this can he easily extended to higher 
dimensions. For brevity, all the examples in this section assume a 2-dimensional array. 
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chare class MyArray : public array { 

// list of private and public data and function members 
entry : 

// list of "entry functions" where messages are received 
MyArray (MessageType *ra) ; // constructor 

void EntryFunction(MessageType *m) ; 

} ; 

3.2 Parallel Array Creation 

A parallel array is created using the operator newgroup , which has the following syntax : 

MapFunctionType mymapfn ; 

MessageType *msgptr ; 

MyArray group arrayidl = newgroup MyArray [XSize] [Ysize] (rasgptr) ; 

MyArray group arrayid2 = newgroup (mymapfn) MyArray [XSize] [Ysize] (rasgptr) ; 

The code above creates two-dimensional parallel arrays with sizes XSize and YSize in X and Y 
dimensions. The newgroup operator causes all the array element objects to be created (and their 
constructors invoked) on their respective processors. The parameter msgptr is sent to all processors 
as the parameter to the constructor for each array element. The first array above uses the default 
mapping function. The second array has a user-specified mapping function mymapfn, which takes 
the coordinates of an element as input and returns the processor where the element is located, 
newgroup is a non-blocking operator that immediately returns the id of the newly created array, 
which has the type MyArray group, and is analogous to a global pointer to an array. Because 
of its non-blocking nature, the elements of an array might not have been created when newgroup 
returns the array id. If necessary, the programmer may explicitly synchronize after initialization 
of ail array elements on all processors by using a suitable reduction or synchronization operation. 
Currently, parallel arrays may be created only from processor 0. 

3.3 Asynchronous messaging: remote method invocation 

The parallel array library provides both point-to-point as well as multicast messaging. All messag- 
ing is asynchronous (no reply value is allowed), in keeping with the non- blocking communication 
paradigm of Charm++. If a reply is desired, the receiving object must send a reply message back 
to the sender object. 

The syntax for point-to-point asynchronous messaging is : 
arrayidfi] [j] =>EntryFunction(rasgptr) ; 

where arrayid is the “global pointer” to the parallel array, i,j are the coordinates of the recipient 
array element, EntryFunction is the function to be invoked in the receiving object, and msgptr is 
the message to be sent across, which is passed as the sole parameter to the function. 

The syntax for multicast asynchronous messaging is : 

arrayidfil. . i2] [jl. . j 2] =>EntryFunct ion (rasgptr) ; // multicast to sub-array 

arrayid[ALL] [j ] =>EntryFunction(msgptr) ; // multicast to column 

arrayidfi] [ALL] =>EntryFunction (rasgptr) ; // multicast to row 

arrayidfALL] [ALL] =>EntryFunction(rasgptr) ; // multicast to whole array 

If an array element is known to be on the local processor, its data and function members may 
be accessed as in sequential C++ : 

arrayidfi] [j] ->dataraember 
arrayidfi] [j] ->functionmeraber ( . . . ) 
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3.4 Remapping and migration 

The parallel array library supports both synchronous remapping and asynchronous object migra- 
tion. Synchronous remapping must be initiated from processor 0 as follows : 

arrayid->remap( (MapFunctionType)newraapfn , return.xhare handle, 

&(ReturnChareType : :ReturnFunction)) ; 

newmapfn is the new mapping function. All array elements will be moved from their original 
locations to their new locations as specified by the new mapping function. After all elements 
have been installed on their new locations, a message is sent to the function ReturnFunction in 
the chare object specified by return_chare_handle. This provides a synchronization point after 
remapping. The user program must ensure that no messages are sent to any elements of the array 
being re-mapped. 

Sometimes such synchronization is impossible or inefficient. Asynchronous remapping or “mi- 
gration” is activated by each array element independently, by calling the function 
migrate ( (MapFunctionType)newmapfn) 

on the array element to be moved. The newmapfn parameter specifies the new mapping function, 
which tells the run-time library the destination processor for the array element. The call results 
in only the specified object being moved to its destination processor. The run-time library will 
correctly forward messages directed to the migrating array element to its new location. 

The actual steps performed by the runtime system while migrating an object are : 

1. Before migrating an object, the runtime library calls a user-provided pack function on the 
object, which copies the object’s data area into a contiguous message buffer. The programmer 
must provide a pack function for every object type that needs migration. (In future, we plan 
to automatically generate such pack and unpack functions based on the interface specification 
for array elements.) 

2. Send the message to the object’s destination processor 

3. Create the new object 

4. Initialize the object’s data area using the message buffer. This is done by another user- 
provided unpack function. (Note: the pack and unpack functions are virtual functions defined 
in the base class array). 

5. Forward messages directed to the object from the old processor to the new processor. 

3.5 Implementation 

The parallel array library is implemented on top of the Converse interoperable run-time framework 
[6], The library can thus be used in conjunction with modules written in other programming systems 
such as PVM and MPI. Although the parallel array concepts we developed were implemented in 
the context of the Charm++ parallel object-oriented language, the essential features are language- 
independent. Currently we are in the process of modifying the Charm++ translator to translate 
the parallel array syntax into calls to C++ functions in the runtime library. The runtime library 
provides functions to create an array, send message to all elements of an array or to a subset of 
array, and various utility functions. 
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3.6 Typical Usage of Array Abstraction 

In this section we describe how to use the Array abstraction in the current form (that is, without 
the translator modifications supporting the syntax described in earlier section.) 

CreateArray function creates a parallel array of objects. The programmer needs to provide 
the CreateArray function with a mapping function. The mapping function takes group id and the 
coordinates as input parameters and returns a processor number on which the array element will 
be placed. An example of a simple mapping function is given below. 

int Grid3D(int gid, int x, int y, int z) 

{ 

return (x*/,NX + (y‘/ 4 NY)*NX + (z‘/.NZ)*NX*NY) ; 

} 

Where total number of processors is NX X NY X N Z . 

An array is typically created in the main function of the main chare. One needs to allocate a 
message and fill in the appropriate fields in the message that will be sent to each element of the 
array for initialization. An example of array creation is given below: 

msgptr = new CreateMessage ; 
msgptr->m = thishandle; 
msgptr->dt = dt; 
rasgptr->omega = omega; 
rasgptr->itmax = itmax; 

//cubearray = CreateArray (Chare (cube) ,EP( cube , cube) ,rasgptr, 

Grid3D ,ncx,ncy ,ncz) ; 

cubearray = CreateArray (_CK_chare_cube, __CK_ep_cube_cube ,rasgptr, 

Grid3D ,ncx ,ncy ,ncz) ; 

Here, a message of type CreateMessage is being sent to the newly created array of cube chares at 
the constructor entry function of each chare. The name mangling will be handled by the translator 
once it supports the array abstraction. However, currently the programmer needs to take care of it. 
(One can use macros Chare and EP to achieve this as shown in comments in the above example.) 
Chare names are translated to integers of the form _CK_char e.charename and the entry functions 
are translated as _CK_ep -chnrcname-entryname. ncx, ncy, and ncz are the sizes of the array in X, 
Y and Z directions respectively. 

The synchronization requirement after the creation of array demands that the newly created 
chares in the array perform a global reduction. This synchronization code needs to be provided by 
the programmer. Typically, this could be achieved by each chare sending a message to the main 
chare and awaiting a message from the main chare to trigger computation. The main chare keeps 
track of how' many synchronization messages it receives and then sends a. message to all the elements 
of the array to start computation. Messages could be sent to a particular element of an array using 
SendArray function or multicast to the entire array (or its subset) using SendArrayRange function. 


4 The NAS Scalar Pentadiagonal (SP) benchmark 

The NAS Scalar Pentadiagonal (SP) benchmark [1] is one of three simulated Computational Fluid 
Dynamics benchmarks in the NAS benchmark suite. It is intended to represent the principal 
computation and communication requirements of CFD applications in use today. 

The SP benchmark involves the solution of multiple independent systems of scalar pentadiagonal 
equations which are not diagonally dominant. The computational space is a three-dimensional 
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structured mesh consisting of 64 x 64 x 64 grid points. The method used is an iterative Alternating 
Direction Implicit (ADI) method. In each iteration there are three “sweeps” successively along 
each of the three coordinate axes. Thus the method involves global spatial data dependences. 

Our main objective in implementing the NAS SP benchmark was to develop a code that could 
be used to easily experiment with different domain decomposition strategies. We also wished to 
leverage the object-orientation provided by the Charm++ parallel object-oriented language [7, 8], 
to develop reusable abstractions that would simplify the process of developing parallel applications. 

4.1 Parallelization schemes 

The steps in the the numerical algorithm [3] which are significant for parallelization are : 

• Computation of the RIIS vector of the partial differential equation. Each grid point in the 
cubical mesh needs values of the U matrix from two neighboring grid points on either side, 
in each of the three dimensions. This corresponds to six ’’parallel-shift” operations. 

• Solution of a system of linear equations in the ^-direction. Each grid point initially needs 
values from two succeeding grid points in the ^-direction (corresponding to a shift operation 
in the negative- £ direction). Then there is a sweep along the positive - x direction in which 
each grid point computes values that are needed by the next two points. 

• Solution of a system of linear equations in the ^/-direction. This is similar to the previous 
step, except that communication is along the ^-direction. 

• Solution of a system of linear equations in the 2 -direction. This is similar to the previous 
step, except that communication is along the 2 -direction. 

Parallelizing these steps requires decomposition of the three-dimensional computational array 
among processors. This decomposition must be done so as to balance computational load across 
processors as well as reduce inter-processor data communication. 

Three of the most common methods used to parallelize ADI methods are [11] : 

• Pipelined static block decomposition : each processor is statically allocated a contiguous three- 
dimensional block of grid points for the entire length of the computation. The block is made as 
close to cubical as possible to minimize the amount of communication (which is proportional 
to surface-area of the block). During the sweeps, each processor receives boundary data from 
the previous processor in the sweep direction, computes its data, and sends its boundary data 
on to the next processor. In order to reduce idle times while processors wait for data from 
previous processors, the computation is pipelined : each processor works on a slice of its grid 
points, sends the resulting boundary on to the next processor, and then goes on to the next 
slice. The disadvantage of this decomposition is that many processors idle at the beginning 
and end of the sweeps; moreover, there are many small messages sent between processors 
corresponding to the boundary data for each slice, which could cause significant overhead on 
machines with large message latencies. 

• Trans pose- based dynamic block decomposition : the three-dimensional mesh is divided into 
slabs oriented along the X direction first. After the X-direction sweep completes a transpose 
operation is done to orient the slabs along the Y-direction, in preparation for the Y-sweep. 
Finally, a third transpose operation is needed before the X-sweep of the next iteration. Thus 
there are a total of three transpose operations needed per iteration. The advantage of this 
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method is that computations within each sweep are completely local to a processor. However, 
the transpose operations between sweeps can result in significant overhead on bandwidth- 
limited machines. 

• The multi-partition or Bruno-Capello decomposition [10, 2] : this is a static decomposition 
where the computational mesh is divided into cubes, and each cube is assigned to a processor 
such that all processors are active at all stages in each of the three sweeps. In other words, each 
coordinate plane in the computational space contains cubes on all processors. Thus processor 
loads are balanced during all stages of all sweeps, and also no transpose operations are needed. 
The minimum number of cubes needed for this decomposition is P 3 ^ 2 (where P is the number 
of processors), so that each processor has y/P cubes. The cube with coordinate k) is 
allocated to processor ( i — k)%s + s({j — k)%s) + 1, where s = y/P and 1 <= i, j,k <= s. 
The tradeoff in the multi-partition method is that computations within a sweep involve cross- 
processor messaging. 

4.2 Implementing the SP benchmark using parallel arrays 

We developed the following abstraction for the NAS SP benchmark code : The computational 
space is represented as a three-dimensional array of cubical sub-spaces. Each cube is represented 
by a parallel object in Charm++, which communicates with other cubes by sending and receiving 
messages. Thus the parallel program consists of a network of communicating objects. 

The different decomposition/mapping strategies are expressed by simply specifying a different 
mapping function for the parallel array. E.g. the mapping function for the multipartition (Bruno- 
Capello) decomposition is : 

int Multipart it ionMapFn(int arrayid, int i, int j, int k) 

{ // return processor number owning object (i,j,k) 

return ( XArraySize*( ( i-k+XArraySize )'/,XArraySize) + 

( j-k+YArraySize)y,YArraySize ) ; 

} 

For the transpose method, all adjacent cubes along the direction of the sweep are mapped to the 
same processor. E.g. the mapping function for the sweep along the X-axis is : 

int XSweepMapFn(int arrayid, int i, int j, int k) 

{ 

return ( ZArraySize * j + k ) ; 

} 

The transpose is effected by simply doing a remap operation on the parallel array between sweeps, 
with the mapping function corresponding to the orientation of the next sweep. Thus we have a very 
flexible, elegant code which allows us to concentrate on experiments with the application, instead 
of getting involved in the details of implementing the decomposition. 

The asynchronous migration facility provided with parallel arrays allows us to further optimize 
the transpose method by overlapping communication and computation. Each cube object migrates 
itself as soon as it has completed its work along one sweep. Thus the communication overhead 
of transferring its data to another processor is overlapped with the computation performed by 
other cubes. This overlap gives significant performance advantages over the traditional loosely- 
synchronous (separate phases of computation and communication) implementations. 
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4.3 Performance results 


Table 1 presents performance results for the different decompositions in the Charm+ + implemen- 
tation of the NAS SP benchmark. Sync-Transpose is the transpose-based dynamic block decom- 
position. Async-Transpose is the dynamic block decomposition with asynchronous migration of 
parallel objects for moving data between sweeps. Note that the only modification needed to change 
the decompositions for the different runs was to change the mapping function. 


Processors 

4 

16 

64 

256 

Sync-Transpose 

- 

8.08 

3.01 

1.98 

Async-Transpose 

- 

7.81 

2.54 

1.40 

Multipartition 

24.63 

7.60 

1.98 

1.00 


Table 1: Time (in milliseconds) for different decompositions for the NAS SP benchmark (size A) 
on the Intel Paragon. 

The results show that the multipartition (Bruno-Capello) decomposition is the best overall, with 
the Async-Transpose and Sync-Transpose decompositions being successively worse. The absolute 
performance of our program does not compare well with performance numbers quoted by vendors 
for the NAS benchmarks. This is mainly because our focus was on flexible parallelization issues, 
and not on tuning the algorithm or code for sequential or absolute performance. 

5 The NAS Lower/Upper Triangular (LU) benchmark 

The LU benchmark is one of the three CFD kernels in NAS benchmarks. It solves a regular-sparse, 
block (5 X 5) lower and upper triangular system. This represents the computations associated 
with the implicit operator of a newer class of implicit CFD algorithms, and has a lower degree 
of parallelism compared with other benchmarks in this suite. All the data-dependencies in this 
benchmark are local (nearest neighbor.) The system of linear equations obtained by replacing the 
spatial derivatives by second-order accurate, central finite difference operators is solved using the 
symmetric successive over- relaxation scheme. Each iteration of this algorithm consists of 

• Computing the Right Hand Side explicitly. 

• Forming and solving the regular, sparse, block lower triangular system. 

• Forming and solving the regular, sparse, block upper triangular system. 

• Updating the solution. 

We used a spatial domain decomposition strategy to split the computational domain into a 
number of cubes. Different communication patterns develop as a result of data dependences in 
the above steps in each iteration. Computing RIIS explicitly in the first step requires domain 
boundaries to be communicated between neighboring cubes (in both positive and negative X, Y 
and Z directions.) In the second step, forming the lower triangular system is a completely local 
operation and does not need any data from neighbors. However, solving the system requires wave- 
like communication pattern in the direction of diagonal of the computation domain. This is a 
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result of data-dependence where an element (i, k) needs elements (i,j — 1, A: — I),(i — 1 — 

1) and (i - 1 ,j - 1 ,k). Computation in step 3 requires a wave-like communication similar to 
step 2 but in opposite direction. Thus, data-dependence mandates that there be three different 
communication patterns in each iteration. Also, the local data-dependence requires that these steps 
cannot be executed parallelly within the same iteration. This reduces the degree of parallelism in 
this benchmark significantly. Figure 1 shows that the available degree of parallelism in our algorithm 
in initial stages of the wave is very small. In the middle stages, where the wave reaches the principal 
diagonal of the 3-D data space, the degree of parallelism is high and then it reduces again in the 
later stages of the wave. 



Stage Number 

Figure 1: Degree of Parallelism in LU 

In algorithms such as these, the only sources of enhancing performance are proper scheduling 
of work and overlapping communication and computation. One of the advantages of programming 
this application in a message- driven language such as Charm+ + is that, the programmer only 
has to code the data-dependencies and leave the scheduling and overlapping communication and 
computation to the run-time system. 

The implementation of this benchmark using the parallel array abstraction in Charm++ for 
domain decomposition is similar to the implementation of SP benchmark described in earlier section. 
This substantiates one of our main claims in this work that, using parallel object-oriented languages 
such as Charm++, one could develop reusable abstractions which could be used in development of 
several applications. However, the approach taken to develop the parallel code for LU differs from 
that of SP. In developing LU benchmark, we first converted the implementation of sequential LU 
algorithm from FORTRAN to C+ + , forming private methods for individual cubes in the process. 
Using the array abstraction to parallelize the C++ code was a very trivial task then. One of the 
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main advantage of this approach was that we were able to eliminate errors due to base-language 
variation early in the process using tools for sequential programs, which are more advanced than 
their counterparts in parallel programming. 

We observed that the LU benchmark showed dependence to some extent on the actual place- 
ment of the cubes and the results of our experiments with different placement strategies reflect 
this observation. Our use of the parallel array abstraction in Charm++ allowed us to efficiently 
experiment with different placement strategies as well as different domain decomposition strategies. 

One of the placement strategies we used is shown in figure 2. For simplicity, we have shown 
a wave-parallel placement pattern for a 2-dimensional 4x4 grid, placed using this strategy on 
4 processors. The advantage of this placement strategy is that as the wave advances from one 
diagonal to the another, all the objects executing methods concurrently are placed on different 
processors. Therefore, we utilize all the processors optimally. 
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Figure 2: Wave- Parallel Placement Strategy 


5.1 Performance Results 

Our experiments were carried out on IBM-SP at Argonne National Laboratories. Charm++ is 
implemented on top of the native MPL communication library on IBM-SP systems. We conducted 
our experiments using 4 different decomposition strategies with 4, 8 and 16 processors and four 
different placement strategies (using different mapping functions during array creation.) The times 
are given for 25 iteration in seconds (The complete benchmark requires 250 iterations.) The different 
decompositions indicate the number of divisions of the computational domain in each direction. 
For example, a 8 X 8 X 1 decomposition means the computational domain was split into 8 parts in 
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X and Y direction, but was left untouched in the Z direction, thus forming 64 cubes. Our results 
are comparable to numbers reported in [3] considering the experience of other users of IBM-SP at 
Argonne National Labs that it is slower than other installations of IBM-SP by almost factor of 2 
for most programs. Also, we have not optimized the sequential part of computations that are coded 
in C H — P rather than in FORTRAN which has a better set of optimization tools available for such 
scientific computations. 


Processors 

4x4x1 

4x4x4 

8x8x1 

8x8x8 

4 

370.580 

308.072 

322.450 

- 

8 

231.840 

151.379 

179.884 

413.638 

16 

210.931 

96.273 

119.704 

81.930 


Table 2: Time (in seconds) for different decompositions for the NAS LU benchmark (size A) on 
the IBM SP using Wave-Parallel Mapping function. 


Processors 

4x4x1 

4x4x4 

8x8x1 

8x8x8 

4 

453.809 

298.159 

300.545 

- 

8 

430.003 

289.487 

220.567 

343.630 

16 

423.968 

289.297 

219.503 

320.542 


Table 3: Time (in seconds) for different decompositions for the NAS LU benchmark (size A) on 
the IBM SP using 10 Grid Mapping function. 


Processors 

4x4x1 

4x4x4 

8x8x1 

8x8x8 

4 

389.677 

267.771 

296.191 

- 

8 

237.368 

171.202 

182.077 

335.665 

16 

208.387 

131.432 

137.044 

66.602 


Table 4: Time (in seconds) for different decompositions for the NAS LIJ benchmark (size A) on 
the IBM SP using 2D Grid Mapping function. 


5.2 Performance Analysis of LU 

This section presents our work on the performance analysis of the LU code performed using a 
performance analysis tool for Charm and Charm++ programs, called Projections. Projections 
is available as a trace generation facility for Chare kernel, the run-time system of Charm and 
Charm-1-1- languages; and as a performance visualization and analysis tool. It includes an expert 
system that works with the trace data generated by the program and analyzes for critical paths, 
phases and degree of parallelism within the code. For enabling projections trace generations, a 
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Processors 

4x4x1 

4x4x4 

8x8x1 

8x8x8 

4 

384.666 

287.692 

297.734 

- 

8 

358.427 

163.656 

294.130 

255.330 

L6 

233.877 

108.904 

185.241 

65.016 


Table 5: Time (in seconds) for different decompositions for the NAS LU benchmark (size A) on 
the IBM SP using 3D-Grid Mapping function. 

Charm++ program should be linked by specifying -execmode projections on the Charm linker 
command line. When this program is run, it produces trace files, one per processor. These files are 
then used as input for the X-windows based Projections visualizations tool. This section presents 
some of the analysis we did using Projections on the LU code. 

One of the main reasons for performance degradation of parallel programs is the improper load- 
balancing. Proper load-balancing is characterized by equal amount of computation on all nodes. 
We checked the performance of LU for the amount of processing on each node. The processing time 
on each processor is shown in figure 3. It is in the range of 48 to 56 percent of the total time on 
each processor. This busy time is calculated based on the entire run of the program that includes 
the Charm initialization time during which most processors are idle. However, during the SSOR 
iterations, the busy time was 70 to 85 percent. Therefore, we concluded that LU is properly load 
balanced. 

Though the total load across all nodes was determined to be similar, the regular structure of 
our application demanded that each type of computations should be distributed in equal amounts 
on all nodes for proper load balance. The main types of computations in LU are building the 
matrices (setiv), solving the lower and upper triangular systems (bits and buts respectively), 
and computing the RHS (rhs). We have made each of these computations into entry methods of the 
chare cube, therefore, determining the amount of each of these computations across all processors 
amounts to finding out how many times each of these entry methods were invoked on each chare. 
Figure 4 shows this in a graphical format. We can conclude from figure 4 that the individual 
computations were load balanced as well. 

Another reason for the performance degradation especially in the light of our finding of low busy 
time per processor is the overhead imposed by the run-time system in the form of message send- 
ing, message processing and internal copying etc. However, the log files generated by Projections 
indicated that a total of 4424 messages were processed for each iteration of LU by each processor. 
From table 2, each iteration of LU takes 3.24 seconds. Thus the average grainsize of computation 
during the iteration is 732.368 /^seconds. Another experiment was run on SP2, which calculated 
the overhead per message creation and processing. This involved running a simple pingpong pro- 
gram written in Charm++ that transferred messages back and forth between two processors. This 
experiment indicated that the average overhead per message processing was 126 /^seconds. This 
amounts to 17 % overhead per message. 

Next we viewed the aggregate work on all processors as a function of time. And noticed the 
distinct peaks for each of the iterations of LU. An iteration is characterized by forming RIIS, solving 
the lower triangular system and solving the upper triangular system. We noticed that the amount 
of work done is at its peak in the middle of an iteration. This was expected since the degree of 
parallelism is at its highest in a wave-parallel distribution when the wave reaches the diagonal of the 
cube. Thus we figured out that this dependence is the main cause of low performance. We analyzed 
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Figure 3: Busy Time Per Processor 
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the trace data using the expert analysis tool of projections. Projections performed a critical path 
analysis and showed that almost all the computations for each cube are on critical path and that 
the average degree of parallelism is low. This degree of parallelism can be usually increased, as 
suggested by the expert system, by breaking the entry points into multiple entry points which could 
be executed in any order. However, this did not seem to be possible because of the dependences 
present in the problem. We have already split the dependences within each direction into different 
entry methods. However, the computations could be performed upon the arrival of these messages 
from all direction. Therefore, all of those entry methods will have to be on the critical path. Thus 
we concluded that the available degree of parallelism within the problem was very low and further 
optimizations were infeasible. 

6 Conclusions 

We have implemented the CFD kernels in the NAS Parallel Benchmarks using Charm+ + , a parallel 
object-oriented language. In order to simplify expression of multi-dimensional parallel arrays in 
Charm+ + , we implemented a parallel array abstractions using facilities provided by Charm++ and 
its runtime system, Converse. We have shown that the abstractions developed using Charm+J- 
are indeed reusable by implementing both the Scalar Pentadiagonal (SP) and Lower-Upper Trian- 
gulation (LU) benchmarks without any modifications to the abstractions. The higher level array 
abstraction allowed us to experiment with many placement strategies and load- balancing without 
any significant programming overhead. Also, the benchmark code was developed such that the 
communication harness could be reused efficiently by plugging in different code for local computa- 
tions. We have presented the performance results for both the codes using different placement and 
decomposition strategies. However, the main objective of this project was not to demonstrate the 
performance of the code but to demonstrate the ease of programming and experimentation using 
abstractions in parallel object-oriented languages such as Charm+ff. The later was demonstrated 
by the parallel array abstraction that made it possible to switch between different placement and 
domain decomposition strategies by merely providing a different mapping function at array creation 
time. 


A Pseudo-Code for LU Benchmark 

Figure 5 shows pseudo-code for the cube chare in the LU benchmark. This pseudo-code is written 
using a notation called Structured Dagger [5], which is a coordination language built on top on 
Charm-f + . 
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chare cube 

{ 

//chare-local variables declarations 

structentry iterations: (InitMessage *raessage) 

{ 

atomic { 

Initialization ) ; 

// Send Startup Messages Containing Boundary Elements 
// To Neighbors * rhs.entry points 

} 

while (iter<raaxiter) { 

atomic { rhs_init(); } 
overlap { 

when rhs_entry_xral (Bdry *xml) ,rhsjentry_xpl(Bdry *xpl) { 
rhs_x(xml , xpl) ; } 

when rhs_entry_yral(Bdry *yral) ,rhsjentry_ypl(Bdry *ypl) { 
rhs_y (yral , ypl) ;} 

when rhs_entryjzral(Bdry *zral ) ,rhs jentry_zpl(Bdry *zpl) { 
rhs_z(zml ,zpl) ; } 
if(x==0 ftft y==0 && z— 0) { 

// Start the first sweep By sending messages to +1 neighbors 

} 

when XmBdry(Bdry *xramsg) , YmBdry (Bdry *yramsg) ,ZmBdry(Bdry *zmmsg) { 
atomic { 

bits ( ) ; jacld( ) ; 

// Continue the sweep by sending messages to +1 neighbors ' 

// XmBdry, YmBdry, ZmBdry entry-points 

} 

} 

if(x==raaxx && y==maxy && z—raaxz) { 

// Start the reverse sweep by sending messages to -1 neighbors 

} 

when XpBdry(Bdry *xrasg) , YpBdry (Bdry *ymsg) , ZpBdry (Bdry *zmsg) { 
atomic { 

buts( ) ; jacu( ) ; 

//Continue reverse sweep by sending messages to -1 neighbors’ 
// XpBdry, YpBdry, ZpBdry entry-points 
updateu( ) ; 

// Send updated boundaries to neighbors’ rhs_entry points 
iter++ ; 

} 

} 

} 

Figure 5: LU Benchmark Program 
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delete m ; 

NumPhilz++ ; 

if ( NumPhilz < XArraySize*YArraySi2e ) 

return ; 
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void EndADI_KSync(SomeMsg *m) ; 

message class NormsMsg { 

public: double sum [5] ; void FinishedJsweepRemap(SomeMsg *m) ; 
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BoundaryMsg *m = GetBoundary(JPREV,U.ONLY) ; 
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BoundaryMsg *m2 = GetBoundary(KPREV,RSDi)NLY) ; 
SendArray(cubeaiTay, .CKjep_Cube_ADIback_Ksweep, m2, myx, myy, my2-l) ; 



SomeMsg *m2 = new SoraeMsg ; 
controiboc[0]->EndADI_KSync(in2) ; 
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/*** These are the pack and unpack functions for the cube chare 
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rsd { 5, isizl , isiz2 , isiz3) , 
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read (5,*) tolrsd (1) , tolrsd ( 2 ) , tzl = 1.0d+00 / ( dzeta * dzeta ) 

$ tolrsd (3) , tolrsd (4) , tolrsd (5) tz2 = 1.0d+00 / ( 2.0d+00 * dzeta 

tz3 = 1.0d+00 / dzeta 

read problem specification parameters c 
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common/ cvar/ u( 5 , 15121 , 13122 , 15123 ) , 

> rsd (5, isizl, isiz2, isiz3) , 

• fret ( 5 , isizl , isiz2 , isiz3 ) 
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common /cvar/ u{5, isizl, isiz2,isiz3) , 

rsd ( 5, isizl , isiz2 , isiz3 ) , 
; fret (5 , isizl , isiz2 , isiz3 ) 




